A Lexicon for Underspecified Semantic Tagging

نویسنده

  • Paul Buitelaar
چکیده

The paper defends the notion that semantic tagging should be viewed as more than disambiguation between senses. Instead, semantic tagging should be a first step in the interpretation process by assigning each lexical item a representation of all of its systematically related senses, from which further semantic processing steps can derive discourse dependent interpretations. This leads to a new type of semantic lexicon (CoreLex) that supports underspecified semantic tagging through a design based on systematic polysemous classes and a class-based acquisition of lexical knowledge for specific domains. 1 Underspecified semantic tagging Semantic tagging has mostly been considered as nothing more than disambiguation to be performed along the same lines as part-of-speech tagging: given n lexical items each with m senses apply linguistic heuristics and/or statistical measures to pick the most likely sense for each lexical item (see eg: (Yarowsky, 1992) (Stevenson and Wilks, 1997)). I do not believe this to be the right approach because it blurs the distinction between ‘related’ (systematic polysemy) and ‘unrelated’ senses (homonymy : bank bank). Although homonyms need to be tagged with a disambiguated sense, this is not necessarily so in the case of systematic polysemy. There are two reasons for this that I will discuss briefly here. First, the problem of multiple reference. Consider this example from the brown corpus: [A long book heavily weighted with military technicalities]NP , in this edition it is neither so long nor so technical as it was originally. The discourse marker (it) refers back to an NP that expresses more than one interpretation at the same time. The head of the NP (book) has a number of systematically related senses that are being expressed simultaneously. The meaning of book in this sentence cannot be disambiguated between the number of interpretations that are implied: the informational content of the book (military technicalities), its physical appearance (heavily weighted) and the events that are involved in its construction and use (long). The example illustrates the fact that disambiguation between related senses is not always possible, which leads to the further question if a discrete distinction between such senses is desirable at all. A number of researchers have answered this question negatively (see eg: (Pustejovsky, 1995) (Killgariff, 1992)). Consider these examples from brown: (1) fast run-up (of the stock) (2) fast action (by the city government) (3) fast footwork (by Washington) (4) fast weight gaining (5) fast condition (of the track) (6) fast response time (7) fast people (8) fast ball Each use of the adjective ‘fast’ in these examples has a slightly different interpretation that could be captured in a number of senses, reflecting the different syntactic and semantic patterns. For instance: 1. ‘a fast action’ (1, 2, 3, 4) 2. ‘a fast state of affairs’ (5, 6) 3. ‘a fast object’ (7, 8) On the other hand all of the interpretations have something in common also, namely the idea of ‘speed’. It seems therefore useful to underspecify the lexical meaning of ‘fast’ to a representation that captures this primary semantic aspect and gives a general structure for its combination with other lexical items, both locally (in compositional semantics) and globally (in discourse structure). Both the multiple reference and the sense enumeration problem show that lexical items mostly have an indefinite number of related but highly discourse dependent interpretations, between which cannot be distinguished by semantic tagging alone. Instead, semantic tagging should be a first step in the interpretation process by assigning each lexical item a representation of all of its systematically related ‘senses’. Further semantic processing steps derive discourse dependent interpretations from this representation. Semantic tags are therefore more like pointers to complex knowledge representations, which can be seen as underspecified lexical meanings. 2 CoreLex: A Semantic Lexicon with Systematic Polysemous Classes In this section I describe the structure and content of a lexicon (CoreLex) that builds on the assumptions about lexical semantics and discourse outlined above. More specifically, it is to be ‘structured in such a way that it reflects the lexical semantics of a language in systematic and predictable ways’ (Pustejovsky, Boguraev, and Johnston, 1995). This assumption is fundamentally different from the design philosophies behind existing lexical semantic resources like WordNet that do not account for any regularities between senses. For instance, WordNet assigns to the noun book the following senses: publication product, production fact dramatic composition, dramatic work record section, subdivision journal Figure 1: WordNet senses for the noun book At the top of the WordNet hierarchy these seven senses can be reduced to two unrelated ‘basic senses’: the content that is being communicated (communication) and the medium of communication (artifact). More accurately, book should be assigned a qualia structure which implies both of these interpretations and connects them to each of the more specific senses that WordNet assigns: that is, facts, drama and a journal can be part-of the content of a book; a section is part-of both the content and the medium; publication, production and recording are all events in which both the content and the medium aspects of a book can be involved. An important advantage of the CoreLex approach is more consistency among the assignments of lexical semantic structure. Consider the senses that WordNet assigns to door, gate and window:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-automatic Induction of Systematic Polysemy from WordNet

This paper describes a semi-automatic method of inducing underspecified semantic classes from WordNet verbs and nouns. An underspecified semantic class is an abstract semantic class which encodes systematic polysem~f, a set of word senses that are related in systematic and predictable ways. We show the usefulness of the induced classes in the semantic interpretations and contextual inferences o...

متن کامل

Ensemble-based Semantic Lexicon Induction for Semantic Tagging

We present an ensemble-based framework for semantic lexicon induction that incorporates three diverse approaches for semantic class identification. Our architecture brings together previous bootstrapping methods for pattern-based semantic lexicon induction and contextual semantic tagging, and incorporates a novel approach for inducing semantic classes from coreference chains. The three methods ...

متن کامل

Underspecified Phonological Features for Lexical Access

The FUL (featurally underspecified lexicon) system of automatic speech recognition is based on the representation of words in the lexicon with underspecified distinctive features. The speech signal is converted from the waveform into an online spectral representation made up of LPC formants and a few parameters describing the overall spectral shape. These spectral parameters are converted into ...

متن کامل

Providing Robustness for a CCG System

We demonstrate ways to preserve the advantages of using a symbolic grammar formalism as the basis of an NLP system while enhancing its robustness. We automatically acquire a CCG lexicon, combine it with semantic and morphological information from another hand-built, underspecified lexicon, and integrate it with statistical preprocessing methods.

متن کامل

Sense Tagging: Semantic Tagging with a Lexicon

Sense tagging, the automatic assignment of the appropriate sense from some lexicon to each of the words in a text, is a specialised instance of the general problem of semantic tagging by category or type. We discuss which recent word sense disambiguation algorithms are appropriate for sense tagging. It is our belief that sense tagging can be carried out effectively by combining several simple, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9705011  شماره 

صفحات  -

تاریخ انتشار 1997